Recently deep neural networks (DNNs) have been used to learn speakerfeatures. However, the quality of the learned features is not sufficientlygood, so a complex back-end model, either neural or probabilistic, has to beused to address the residual uncertainty when applied to speaker verification,just as with raw features. This paper presents a convolutional time-delay deepneural network structure (CT-DNN) for speaker feature learning. Ourexperimental results on the Fisher database demonstrated that this CT-DNN canproduce high-quality speaker features: even with a single feature (0.3 secondsincluding the context), the EER can be as low as 7.68%. This effectivelyconfirmed that the speaker trait is largely a deterministic short-time propertyrather than a long-time distributional pattern, and therefore can be extractedfrom just dozens of frames.
展开▼